Statistical Part-of-Speech Tagger for Traditional Arabic Texts

نویسنده

  • Yahya O. Mohamed Elhadj
چکیده

Problem statement: This study presented the development of an Arabic part-of-speech tagger that can be used for analyzing and annotating traditional Arabic texts, especially the Quran text. Approach: It is a part of a project related to the computerization of the Holy Quran. One of the main objectives in this project was to build a textual corpus of the Holy Quran. Results: Since an appropriate textual version of the Holy Quran was prepared and morphologically analyzed in other stages of this project, we focused in this work on its annotation by developing and using an appropriate tagger. The developed tagger employed an approach that combines morphological analysis with Hidden Markov Models (HMMs) based-on the Arabic sentence structure. The morphological analysis is used to reduce the size of the tags lexicon by segmenting Arabic words in their prefixes, stems and suffixes; this is due to the fact that Arabic is a derivational language. On another hand, HMM is used to represent the Arabic sentence structure in order to take into account the linguistic combinations. For these purposes, an appropriate tagging system has been proposed to represent the main Arabic part of speech in a hierarchical manner allowing an easy expansion whenever it is needed. Each tag in this system is used to represent a possible state of the HMM and the transitions between tags (states) are governed by the syntax of the sentence. A corpus of some traditional texts, extracted from Books of third century (Hijri), is manually morphologically analyzed and tagged using our developed tagset. Conclusion/Recommendations: It is then used for training and testing this model. Experiments conducted on this dataset gave a recognition rate of about 96% and thus are very promising compared to the data size tagged till now and used in the training. Since our Holy Quran corpus is still under revision, we did not make significant experiments on it. However, preliminary tests conducted on the seven verses of AL-Fatiha showed an encouraging accuracy rate.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Topic Identification In Chinese Based On Centering Model

In this paper we are concerned with identifying the topics of sentences in Chinese texts. The key elements of the centering model of local discourse coherence are employed to identify the topic which is the most salient element in a Chinese sentence. Due to the phenomenon of zero anaphora occurring in Chinese texts frequently, in addition to the centering model, we further employ the constraint...

متن کامل

Morphological Analysis and Disambiguation for Dialectal Arabic

The many differences between Dialectal Arabic and Modern Standard Arabic (MSA) pose a challenge to the majority of Arabic natural language processing tools, which are designed for MSA. In this paper, we retarget an existing state-of-the-art MSA morphological tagger to Egyptian Arabic (ARZ). Our evaluation demonstrates that our ARZ morphology tagger outperforms its MSA variant on ARZ input in te...

متن کامل

Fine-Grain Morphological Analyzer and Part-of-Speech Tagger for Arabic Text

Morphological analyzers and part-of-speech taggers are key technologies for most text analysis applications. Our aim is to develop a part-of-speech tagger for annotating a wide range of Arabic text formats, domains and genres including both vowelized and non-vowelized text. Enriching the text with linguistic analysis will maximize the potential for corpus re-use in a wide range of applications....

متن کامل

POS Tagging of Dialectal Arabic: A Minimally Supervised Approach

Natural language processing technology for the dialects of Arabic is still in its infancy, due to the problem of obtaining large amounts of text data for spoken Arabic. In this paper we describe the development of a part-of-speech (POS) tagger for Egyptian Colloquial Arabic. We adopt a minimally supervised approach that only requires raw text data from several varieties of Arabic and a morpholo...

متن کامل

Multi-word Term Extraction Based on New Hybrid Approach for Arabic Language

Arabic Multiword Term are relevant strings of words in text documents. Once they are automatically extracted, they can be used to increase the performance of any text mining applications such as Categorisation, Clustering, Information Retrieval System, Machine Translation, and Summarization, etc. This paper introduces our proposed Multiword term extraction system based on the contextual informa...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009